Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Binary Neural Architecture Search

The value of selecting an operation ˜rk is the expected reward rewardi we receive when

we take an operation from the possible set of operations. If nk approaches inﬁnity, ˜rk

approaches the actual value of the operation rk. However, the number of operations nk

cannot be inﬁnite. Therefore, we should approximate the actual value as closely as possible

through the variance.

Deﬁnition 2. There exists a diﬀerence between the estimated probability ˜rk and the actual

probability rk, and we can estimate the variance concerning the value

˜δk =

2 ln N

(4.3)

where N is the total number of trails.

Proof. Suppose X ∈[0, 1] represents the theoretical value of each independently distributed

operation. n is the number of times the arm has been played up to trial, and pi is the actual

value of the operation in the i^thtrail. Furthermore, we deﬁne p =

i ^pⁱ

and q = 1 −p.

Since the variance boundary of independent operations can represent the global variance

boundary (see the Appendix), based on Markov’s inequality, we can arrive at below :

P[X > p + δ] = P[

(Xi −pi) > δ]

= P[e^λ

i⁽^Xⁱ⁻^pⁱ⁾> e^λδ]

≤^E^[^e^λ

i⁽^Xⁱ⁻^pⁱ⁾]

e^λδ

(4.4)

Since we can get 1 + x ≤e^x≤1 + x + x²when 0 ≤|x| ≤1), E[e^λ

i⁽^Xⁱ⁻^pⁱ⁾] in Eq. 4.4

can be further approximated as follows:

E[e^λ

i⁽^Xⁱ⁻^pⁱ⁾] =

E[e^λ⁽^Xⁱ⁻^pⁱ⁾]

≤

E[1 + λ(Xi −pi) + λ²(Xi −pi)²]

(1 + λ²v²

i ⁾

≤e^λ²^v²,

(4.5)

where v denotes the variance of X. Combining Eq. 4.4 and Eq. 4.5 gives P[X > p +

δ] ≤^e^λ²^v²

e^λδ^{. Since}^λ^{is a positive constant, it can be obtained by the transformation of the}

values P[X > p + δ] ≤e⁻²^nδ². According to the symmetry of the distribution, we have

P[X < p −δ] ≤e⁻²^nδ². Finally, we get the following inequality:

P[|X −p| ≤δ] ≥1 −2e⁻²^nδ².

(4.6)

We need to decrease δ as operating recommendations increase. Therefore, we choose

2 ln N

as ^˜δ. That is to say, p −

2 ln N

≤X ≤p +

2 ln N

is implemented at least with

probability 1−

N ⁴^{. The variance value will gradually decrease as the trail progresses, and ˜}^r^k

will gradually approach rk. Equation 4.7 shows that we can achieve a probability of 0.992

when the number of the trail gets 4.

1 −²

N ⁴⁼

⎧

⎪

⎨

⎪

⎩

0.857

N=2

0.975

N=3

0.992

N=4.

(4.7)